TensorFlow - Recurring Neural Neural Network Example

About this notebook

Author: Philip Reschke (http://www.philipreschke.com).

Project: https://github.com/PhilipReschke/TensorFlow-Code-Examples

I will build a Recurring Neural Network using a number of LSTM layers to predict whether a movie review is positive or negative. I will be using an embedding layer instead of one-hot encoding all my inputs as that is comutationally inefficient when we have 70,000+ words. This is a modified example of the RNN sentiment example that is part of the Udacity Deep Learning Foundation class.

Requirements

  • Python 3.5
  • TensorFlow 1.1

Import dependencies


In [1]:
import tensorflow as tf
import numpy as np

Import data

As for the raw movie reviews and their positive/negative classification, I will be using the 25,000 reviews available on the Udacity Deep Learning GibHub page @ https://github.com/udacity/deep-learning/blob/master/sentiment_network/.


In [2]:
with open('data/reviews.txt', 'r') as raw_reviews:
    reviews = raw_reviews.read()
with open('data/labels.txt', 'r') as raw_labels:
    labels = raw_labels.read()

Our reviews data looks like this:


In [3]:
reviews[:1000]


Out[3]:
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn'

As we can see from the above example, the data is not very tidy as it contains line brakes such as '\n' - these actually represent a new rewiew. Using a few tidy rules, I will be able to clean up the raw so that we will have less noise when training our model.

Data cleansing

Before we can build a RNN using TensorFlow, we need to get the data into a proper shape. As I will be using an embedding layer, I need to convert each word into an integer. I will also be doing a bit of data tidying.


In [4]:
# Removing punctuations and 'br' line brake tags
reviews_complete_text = ''.join([char for char in reviews if char != '.'])
reviews_complete_text = reviews_complete_text.replace('  br    br  ', '')

# Splitting the reviews sring into a list of reviews
reviews_list = reviews_complete_text.split('\n')

The reviews are now seperated from eachother and stored in seperate list objects. Here are the first two reviews:


In [5]:
reviews_list[0:2]


Out[5]:
['bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   ',
 'story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly   ']

We will also need to build up a vocabulary of all the words in our reviews dataset so lets do that now:


In [6]:
text_in_reveiws = ''.join(reviews_list)
words_in_reviews= text_in_reveiws.split()

In [7]:
words_in_reviews[:10]


Out[7]:
['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

Encoding words

To implement an embedding layer, we must pass integers to our network. To prepare our reviews for this, we will create a dictionary that map the words in our vocabulary to integers. Using that, we can convert our reviews into integers.


In [8]:
from collections import Counter

# Create word counter and sort by the number of occurences of each word in descending order 
word_counts = Counter(words_in_reviews)
vocabulary = sorted(word_counts, key=word_counts.get, reverse=True)

# Create a dictionary of words
vocabulary_to_int = {word: i for i, word in enumerate(vocabulary, 1)}

# Create empty reviews list
reviews_int = []
for each in reviews_list:
    reviews_int.append([vocabulary_to_int[word] for word in each.split()])

Our reviews now appear as integers. Here are the first two reviews as integers and raw text. Nice!


In [9]:
np.array(reviews_int)[0:2], reviews_list[0:2]


Out[9]:
(array([ [22014, 307, 6, 3, 1049, 206, 7, 2138, 31, 1, 170, 56, 14, 48, 80, 5836, 43, 381, 109, 139, 14, 5219, 59, 153, 8, 1, 4999, 5861, 474, 70, 5, 259, 11, 22014, 307, 12, 1977, 6, 73, 2405, 5, 613, 72, 6, 5219, 1, 25790, 5, 1988, 10388, 1, 5842, 1503, 35, 50, 65, 203, 144, 66, 1201, 5219, 20494, 1, 44440, 4, 1, 220, 882, 30, 2996, 70, 4, 1, 5832, 9, 685, 2, 66, 1503, 53, 9, 215, 1, 382, 8, 61, 3, 1405, 3699, 782, 5, 3487, 179, 1, 381, 9, 1211, 13603, 31, 307, 3, 348, 340, 2922, 9, 142, 126, 5, 7803, 29, 4, 128, 5219, 1405, 2332, 5, 22014, 307, 9, 527, 11, 108, 1447, 4, 59, 542, 101, 11, 22014, 307, 6, 226, 4176, 47, 3, 2211, 11, 7, 214, 22],
        [62, 4, 3, 124, 35, 46, 7571, 1395, 15, 3, 4200, 504, 44, 16, 3, 621, 133, 11, 6, 3, 1278, 456, 4, 1721, 206, 3, 10663, 7376, 299, 6, 666, 82, 34, 2115, 1087, 3004, 33, 1, 898, 61494, 4, 7, 12, 5141, 463, 7, 2663, 1721, 1, 220, 56, 16, 57, 793, 1296, 833, 227, 7, 42, 97, 122, 1469, 58, 146, 37, 1, 962, 141, 28, 666, 122, 1, 13638, 409, 60, 94, 1777, 305, 755, 5, 3, 818, 10596, 21, 3, 1725, 634, 7, 12, 127, 72, 20, 232, 101, 16, 48, 49, 617, 33, 684, 84, 29994, 30691, 684, 373, 3348, 11468, 2, 16708, 8028, 50, 28, 107, 3339]], dtype=object),
 ['bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   ',
  'story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly   '])

Embedding labels

We will also have to convert our labels into integers so we can predict against them. Let's do that now.


In [10]:
labels_split = labels.split('\n')
labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

Our labels are now encoded as 1 and 0. See here:


In [11]:
np.array(labels_split[0:20]), labels[0:20]


Out[11]:
(array(['positive', 'negative', 'positive', 'negative', 'positive',
        'negative', 'positive', 'negative', 'positive', 'negative',
        'positive', 'negative', 'positive', 'negative', 'positive',
        'negative', 'positive', 'negative', 'positive', 'negative'], 
       dtype='<U8'),
 array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]))

Shortering reviews

To help pur RNN learn, we want to ensure that none of the reviews are too long just as we want to ensure that we don't have any reviews with zero length - just in case we are working with bad data, which we often are - that is life of a data scientist.


In [12]:
# Review length counter
review_length = Counter([len(review) for review in reviews_int])

# Reviews with 0 length and longest review
review_length[0], max(review_length)


Out[12]:
(1, 2494)

Lets first remove the reviews of zero length by creating an index:


In [13]:
# Index of reviews with non-zero length
non_zero_idx = [ii for ii, review in enumerate(reviews_int) if len(review) != 0]

In [14]:
# Remove zero length reviews from reviews and labels
reviews_int = [reviews_int[ii] for ii in non_zero_idx]
labels = np.array([labels[ii] for ii in non_zero_idx])

# Check labels and reviews length
len(reviews_int), len(labels)


Out[14]:
(25000, 25000)

Now that the empty review is gone, let's reduce the length of each review to a fixed length of 250 words and pad with 0 where reviews are shorter than 250 words.

While we are at it, we might as well create our input array, which will be N by M, where N is the number of reviews in our dataset and M is our desired review length. Hence, each row is a review of wanted review length.


In [15]:
# Desired review length - we cut of the rest of the review
seq_len = 250

# Creating a zero matrix of dimensions N by M
features = np.zeros((len(reviews_int), seq_len), dtype=int)

# Filling in the reviews
for i, row in enumerate(reviews_int):
    features[i, -len(row):] = np.array(row)[:seq_len]

We now have our features ready for training:


In [16]:
features[:2,:-100]


Out[16]:
array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0, 22014,   307,     6,     3,  1049,   206,     7,
         2138,    31,     1,   170,    56,    14,    48,    80,  5836,
           43,   381,   109,   139,    14,  5219,    59,   153,     8,
            1,  4999,  5861,   474,    70,     5,   259,    11, 22014,
          307,    12,  1977,     6,    73,  2405],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,    62,     4,     3,   124,    35,    46,  7571,  1395,
           15,     3,  4200,   504,    44,    16]])

In [17]:
features.shape


Out[17]:
(25000, 250)

Training, validating and testing dataset

My reviews and labels data is now cleanred up as much as required to demonstrate how a RNN works with TensorFlow so its time to split it into a training, validation and testing set. Lets set aside 80% for training and 10% for validation and testing, respectively.


In [18]:
# Split faction
training_faction = 0.8

# Index to split the dataset at for training and validation
training_idx = int(len(features) * training_faction)

# Splitting the dataset for train and val
train_x, val_x = features[:training_idx], features[training_idx:]
train_y, val_y = labels[:training_idx], labels[training_idx:]

# Index to split the dataset at for validation and testing
validation_idx = int(len(val_x) * 0.5)

# Splitting the dataset for val and testing
val_x, test_x = val_x[:validation_idx], val_x[validation_idx:]
val_y, test_y = val_y[:validation_idx], val_y[validation_idx:]

Our dataset is now split into the following training, validation and testing set:


In [19]:
print("Training: \t\t{}".format(train_x.shape), 
      "\nValidation: \t\t{}".format(val_x.shape),
      "\nTesting: \t\t{}".format(test_x.shape))


Training: 		(20000, 250) 
Validation: 		(2500, 250) 
Testing: 		(2500, 250)

TensorFlow RNN graph

Hyperparameters

For use in our graph, we will need the following four hyper parameters:


In [20]:
lstm_size = 512       # Number of units in each of our hidden layers in the LSTM cells
lstm_layers = 2       # Number of LSTM layers in the network
batch_size = 500      # Number of reviews, out of the 20,000 in our training set, that we feed into the network in one go
learning_rate = 0.005 # Our learning rate for use in the Adam optimizer

Graph placeholders

Let's define our graph placeholders for inputs, labels and our keep probability for use with our dropout wrapper.


In [21]:
n_words = len(vocabulary_to_int)

# Create the graph object
graph = tf.Graph()

# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

In [22]:
n_words


Out[22]:
74072

Embedding layer

I'll now create an embedding layer. We need this because there are 74,072 words in our vocabulary. It is massively inefficient to one-hot encode our classes.


In [23]:
# Number of units in the embedding layer
embed_size = 300 

with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

LSTM Cell

Using the already defined hyperparameters, I will define my LSTM cells here.


In [24]:
with graph.as_default():
    # Your basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

RNN forward pass

We are now ready to passing data forward in the RNN, which will be done using tf.nn.dynamic_rnn.


In [25]:
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                             initial_state=initial_state)

Output

As we only care about the final output of the LSTM cells, we will grab it using outputs[:, -1] and pass it through a fully connected layer with our Sigmoid activation function.


In [26]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

Validation accuracy

Standard code for calculating accuracy.


In [27]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Batching

We will be using a simple function for returning batches from our review data. First it removes data such that we only have full batches. Then it iterates through the x and y arrays and returns slices out of those arrays with size [batch_size].


In [28]:
def get_batches(x, y, batch_size=100):
    
    # Calculate number of batches
    n_batches = len(x)//batch_size
    
    # Obtain only full batches
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

Training


In [29]:
# If the checkpoints directory doesn't exist:
!mkdir checkpoints

In [30]:
epochs = 5

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")


Epoch: 0/5 Iteration: 5 Train loss: 0.413
Epoch: 0/5 Iteration: 10 Train loss: 0.257
Epoch: 0/5 Iteration: 15 Train loss: 0.237
Epoch: 0/5 Iteration: 20 Train loss: 0.221
Epoch: 0/5 Iteration: 25 Train loss: 0.245
Val acc: 0.499
Epoch: 0/5 Iteration: 30 Train loss: 0.235
Epoch: 0/5 Iteration: 35 Train loss: 0.228
Epoch: 0/5 Iteration: 40 Train loss: 0.218
Epoch: 1/5 Iteration: 45 Train loss: 0.277
Epoch: 1/5 Iteration: 50 Train loss: 0.234
Val acc: 0.577
Epoch: 1/5 Iteration: 55 Train loss: 0.216
Epoch: 1/5 Iteration: 60 Train loss: 0.173
Epoch: 1/5 Iteration: 65 Train loss: 0.165
Epoch: 1/5 Iteration: 70 Train loss: 0.139
Epoch: 1/5 Iteration: 75 Train loss: 0.164
Val acc: 0.798
Epoch: 1/5 Iteration: 80 Train loss: 0.148
Epoch: 2/5 Iteration: 85 Train loss: 0.141
Epoch: 2/5 Iteration: 90 Train loss: 0.127
Epoch: 2/5 Iteration: 95 Train loss: 0.109
Epoch: 2/5 Iteration: 100 Train loss: 0.063
Val acc: 0.872
Epoch: 2/5 Iteration: 105 Train loss: 0.013
Epoch: 2/5 Iteration: 110 Train loss: 0.004
Epoch: 2/5 Iteration: 115 Train loss: 0.006
Epoch: 2/5 Iteration: 120 Train loss: 0.006
Epoch: 3/5 Iteration: 125 Train loss: 0.183
Val acc: 0.853
Epoch: 3/5 Iteration: 130 Train loss: 0.110
Epoch: 3/5 Iteration: 135 Train loss: 0.069
Epoch: 3/5 Iteration: 140 Train loss: 0.033
Epoch: 3/5 Iteration: 145 Train loss: 0.008
Epoch: 3/5 Iteration: 150 Train loss: 0.001
Val acc: 0.601
Epoch: 3/5 Iteration: 155 Train loss: 0.000
Epoch: 3/5 Iteration: 160 Train loss: 0.000
Epoch: 4/5 Iteration: 165 Train loss: 0.310
Epoch: 4/5 Iteration: 170 Train loss: 0.163
Epoch: 4/5 Iteration: 175 Train loss: 0.119
Val acc: 0.733
Epoch: 4/5 Iteration: 180 Train loss: 0.065
Epoch: 4/5 Iteration: 185 Train loss: 0.039
Epoch: 4/5 Iteration: 190 Train loss: 0.033
Epoch: 4/5 Iteration: 195 Train loss: 0.007
Epoch: 4/5 Iteration: 200 Train loss: 0.004
Val acc: 0.704

Testing


In [31]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))


Test accuracy: 0.730